Synthetic Ontologies: A Hypothesis 


Towards More Flexible and Adaptable Knowledge Management Systems 


Introduction 


As we move further into the digital age, data is 
becoming an increasingly valuable commodity that 
drives innovation, research, and development across 
all industries. The ability to extract insights from data is 
heavily reliant on the quality and structure of 
knowledge graphs, which in turn depends on the 
accuracy and efficiency of natural language processing 
algorithms. While pre-defined ontologies have been a 
useful tool for organizing data, they are limited in their 
ability to capture the complexity and nuance of 
relationships between concepts. This essay explores a 
possibility to develop alternative approaches for 
ontology engineering to create more flexible and 
adaptable knowledge graphs. In particular, this work 
outlines the challenges of unifying data exchange 
standards and proposes the use of completely 
synthetic, i.e. purely machine generated, ontologies as 
a potential solution. 


Problem Definition 


Data Exchange Standards as Key Value Drivers 
The emergence of the Internet age can be attributed to 
multiple factors. As Andrew Russell notes in his book 
"Open Standards and the Digital Age," the 
development of communication standards stands out 
as a crucial catalyst. Much of the advancement can be 
traced back to the development of graph theory, which 
has facilitated the creation of highly efficient search 
algorithms, social networks, and recommendation 
systems that harness the power of graph analysis. 


In the early 1990s, John Barlow's seminal work 
"Cybernomics" laid the groundwork for the information 


economy by outlining its key implications. The one that 
especially stands out is that “relationships replace 
things” — that a relationship is represented by an 
“active flow of information, the greater the flow the 
more valuable is the relationship.” The other key 
observation is that “transaction becomes continuous.” 
Three decades after the publication of “Cybernomics” 
we see the rise of the social networks. The other 
prediction — about continuous transactions — manifests 
itself in a multitude of various subscription models, and 
especially as software as a service, which seems to be 
the most popular business model in the IT industry. 
Thus, the flow of data is crucial for the economy and 
the creation of value. 


In the field of economics, the creation of value is 
attributed to labour, capital, and total factor 
productivity. The latter measures the overall efficiency 
of an activity. As we live in the information economy 
where the value of services prevails over the value of 
products, the total factor productivity plays a key role. 
At the same time, analysts, data scientists, consultants, 
etc. are well aware that roughly 80% of their time is 
spent transforming data into a format that is amenable 
for analysis. Thus, there is a growing need to find a 
solution to the multitude of standards used for data 
exchange to optimize inefficiencies. It is worth noting 
that, according to some estimates, around one-third of 
the data being generated worldwide daily is medical 
data making it the best candidate for trying out new 
approaches to data processing. Streamlining data 
exchange standards, we can unlock value in previously 
untapped areas of knowledge. 


Future Shock and Language Singularity 

Before the term “singularity” became a household 
item, a common term was “future shock” coined by 
Alvin Toffler in the book by the same title. Back in 1970, 
he observed an ever-increasing acceleration of change 
and information overload. Today the future shock has 
finally caught up with our language. The number of 
item sub-categories that get manufactured or provided 
as a service per unit of language increases; therefore, 
as the usage of any word becomes more frequent, its 
meaning collapses to the most frequent association not 
necessarily reflecting its “yesterday’s _ original 
meaning.” As a result, unification of standards is a 
challenge in many industries. The recent developments 


with large language models (LLMs) make this problem 
especially difficult in knowledge management. It can be 
argued that LLMs will produce a lot of noise and 
exacerbate the biases present in the data on which 
they were trained distorting the language further. 


The environment in of current data formats can be 
described by “VUCA.” VUCA is an acronym that stands 
for Volatility, Uncertainty, Complexity, and Ambiguity. 
It is a concept that originated in 1987 based on the 
leadership theories by Warren Bennis and Burt Nanus 
and quickly adopted by the field of military strategy but 
has since been applied to various contexts. A VUCA 
world is one that is characterized by rapid and 
unpredictable change, where traditional approaches to 
problem-solving and decision-making may no longer be 
effective. Organizations and individuals must be 
adaptable, resilient, and able to navigate complexity 
and ambiguity in order to succeed. The concept of 
VUCA is often used to describe the current state of the 
global economy and _ the_ challenges facing 
organizations in the 21st century. 


Thus, the problem can be distilled to finding a solution 
to developing ontologies extremely quickly, when new 
scientific discoveries make yesterday’s knowledge 
obsolete in a matter of months, if not weeks. 


Problem Analysis 


Two Key Problems with the Language Itself 

First of all, the nature of the language is different from 
the nature of reality. According to the work of Vasilii 
Nalimov titled “In the Labyrinths of Language: A 
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Mathematical Journey,” the key characteristic of any 
language, is its discrete nature. At the same time, the 
author points out the continuous nature of reality 
which creates some unresolvable issues with the use of 
a language to describe the world around us. 


Secondly, the language, as a formal logical structure is 
either internally inconsistent or incomplete if viewed as 
a closed set of Anyone familiar with 
Incompleteness theorems of 20" century logician Kurt 
Gédel understands that paradoxes or contradictions 
are inevitable in any closed logical system. 


rules. 


Ontologies.as a Meta-Language 

Ontologies came out of the branch of philosophy 
known as metaphysics, which deals with the nature of 
reality and existence. Since the 1970s, researchers 
recognized the need to develop importance of 
ontologies for knowledge engineering and building 
powerful Al algorithms. Wikipedia defines ontology in 
the field of computer science and information science 
as “a representation, formal naming, and definition of 
the categories, properties, and relations between the 
concepts, data, and entities that substantiate one, 
many, or all domains of discourse.” 


To spare the reader from the mathematical jargon, let’s 
dive right into the relation between ontologies and 
knowledge graphs by resorting to a diagram. To analyse 
the problem, the author adopted the approach 
documented by Huang (2022) in the paper “Causal 
Relationships over Knowledge Graphs.” The figure 
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below is a concept map that shows ontology from the 
point of view of computer science. 


Common Methods of Ontology Engineering 
The discussion of this matter at length lies beyond the 
scope of this particular work, whose goal is to build a 
case to support the hypothesis proposed in the next 
section. However, it is worth mentioning that the 
contemporary literature provides a broad range of 
approaches to ontology engineering. Among multiple 
ways to classify those categories is a “top-down” anda 
“pottom-up” approaches. While the top-down 
methodology emphasizes the reuse of existing domain 
ontologies and upper-level ontologies, the bottom-up 
methodology is used to specifically reveal the “possible 
semantics of the data” on which the ontology is 
constructed (Lopez-Pellicer et al., 2007, “Using a hybrid 
approach for the development of an ontology in the 
hydrographical domain”). 


Proposed Solution 


Bottom-Up Approach to Ontology Construction 
Those specializing in qualitative data analysis are 
familiar with the ‘grounded theory method’ — a non- 
axiomatic method of theory construction. The gist of 
the method is to use underlying data to determine 
patterns, draw hypotheses, and test them. Thus, we 
can hypothesize a complete model of the underlying 
data in the form of a meta-language — an ontology, 
and then test its validity. 


In plain terms, the grounded theory method is a 
bottom-up approach in which relationships between 
concepts are inferred based on the relationships that 
are present in the data. This approach allows for the 
creation of domain-specific knowledge graphs that are 
tailored to the specific needs of the application. By 
starting from the data and inferring relationships based 
on patterns that emerge, the resulting knowledge 
graph can capture the complexity and nuance of the 
relationships between concepts that may not be 
captured by ontologies defined using alternative 
approaches. 


The benefits of the bottom-up approach are significant. 
This approach allows for the creation of more 
comprehensive and accurate knowledge graphs that 
can be personalized to the specific needs of individual 


users. It can capture more nuanced and complex 
relationships between concepts, and can be updated 
and refined over time as new data becomes available, 
making them more and up-to-date. 
Additionally, the approach is scalable and adaptable to 
new contexts and data, making it well-suited for 
applications in knowledge domains where the change 
is especially fast. The bottom-up approaches are 
perhaps more labour intensive; however, the hard part 
can be delegated to LLMs given the right set of 
prompts. 


accurate 


A Synthetic Ontology 
Thus, a seemingly The proposed solution rests on the 
following key processes: 


“* Storage of context, such as text or graphics, 
along with a term from an ontology. 

“* Automatic generation of ontologies based on the 
context and the right prompt for an LLM. 

** Human verification test, which can be semi- 
automated. 

“+ Test of internal consistency. 

** Aversion control system. 

Thus, the solution lies in having an ontology generated 

by an LLM. How to do that is the subject of the ongoing 


and future research. 


Preliminary Results 
Model GPT-NEOX-20B: The model shows 
basic/rudimentary internal consistency given prompts 


of sufficient lengths. It can generate general 
statements about ontologies and very rudimentary, 
very general high level conceptual structures. 


However, it fails to generate statements in JSON 
format, as prompted. 


Model GPT-3.5: The model provides sufficiently well- 
formed answers in JSON format as prompted, correctly 
specifying ontologies. Please see one example below. 
One can easily imagine how it might be possible to 
generate an ontology having processed a corpus of 
texts describing some knowledge domain. An 
interesting artefact is that despite being an 
autoregressive LLM, this model can make “forward- 
looking” links for text it has not yet generated. 


An example of the Taxonomical Part of an Ontology for the Field of Biology 


Prompt: 
Build a taxonomy for the domain of biology that 
consists of 10 broad categories in JSON format. 


Response: 

Sure! Here's an example of a taxonomy for the 
domain of biology, consisting of 10 broad 
categories in JSON format: 


json 


{ 


"taxonomy": "Biology", 
"categories": [ 
{ 
"id" : OL" ; 
"name": "Ecology", 


"children": [] 


"id" : QoQ" A 
"name": "Cellular Biology", 
"children": [] 


"id" : "93" ; 
"name": "Molecular Biology", 
"children": [] 


"id": "04", 
"name": "Genetics", 
"children": [] 


"id" cf "95" y 
"name": "Microbiology", 
"children": [] 


Paice lita "06", 
"name": "Physiology", 
"Children": [] 


"id" 7 "OQO7" ; 
"name": "Anatomy", 
"children": [] 


"id" 7 "os" ; 
"name": "Botany", 
"children": [] 


"id": "09", 
"name": "Zoology", 
"children": [] 


"id": "10", 
"name": "Evolution", 
"children": [] 


In this example, the taxonomy is named 
"Biology" and consists of 10 broad categories: 


Ecology 

Cellular Biology 
Molecular Biology 
Genetics 
Microbiology 
Physiology 
Anatomy 

Botany 

Zoology 
Evolution 


Each category has an ID, a name, and an empty 
list of children. The taxonomy can be expanded 
by adding more subcategories under each of the 


10 broad categories, allowing for more 
detailed classification of biological 
concepts. 


In this example, an LLM generated the first level 
taxonomy. Given the right prompt, it can add relations. 
Obviously, the model is limited by the size of its 
“working memory”; however, this limitation can be in 
part circumvented by prompting the model to provide 
taxonomy and relations separately. Likewise, each 
section can be expanded by providing a specially- 
prepared prompts. 


In the following example, even though it is not 
technically an ontology, you can see how concepts and 
their relations are listed in the same JSON output. 


Given sufficient number of carefully engineered 
prompts, it might be possible to extract the structure of 
any knowledge domain and convert it into an ontology. 


A Sample of Raw / Unedited LLM-Generated Output in JSON Format 


"name": "Future Shock Ontology", 
"concepts": [ "relations": [ 
{ { 
"name": "Accelerating Change", "type": "is-a", 
Wid": 1, "source": 1, 
"description": "The exponential increase "target": 5 
in the pace of change across’ multiple }, 
dimensions, including technological progress, { 
social norms, and economic systems.", "type": "is-a", 
"related concepts": [2, 3, 5] "source": 2, 
}, "target": 5 
{ }, 
"name": "Discontinuity", { 
"id": 2 y "type": "is-a" y 
"description": "The frequent and abrupt "source": 3, 
shifts in social and economic’ systems, "target": 5 
resulting in disorientation and confusion for }, 
individuals and societies.", { 
"related concepts": [1, 3, 5] "type": "is-a", 
}, "source": 4, 
{ "target": 6 
"name": "Overload", }, 
"Nad": 3, { 
"description": "The overwhelming amount "type": "is-a", 
of information and stimulation, leading to "source": 6, 
psychological and social problems.", "target": 5 
"related _concepts": [1, 2, 5] }, 
}, { 
{ "type": "causes", 
"name": "Individualism", "source": 1, 
"id": 4, "target": 2 
"description": "The emphasis on }, 
individualism and personalization, as people { 
seek to cope with the pressures of change and "type": "causes", 
find meaning in their lives.", "source": 1, 
"related_concepts": [1, 3, 6] "target": 3 
}, }, 
{ { 
"name": "Technological Determinism", "type": "causes", 
"aid": Se "source": 2, 
"description": "Technology is seen as "target": 7 
the primary driver of social change, with the }, 
power to shape human behavior and institutions { 
in profound ways.", "type": "causes", 
"related_concepts": [1, 2, 3] "source": 3, 
}, "target": 7 
{ }, 
"name": "Social Fragmentation", { 
"id": 6, "type": "leads-to", 
"description": "The breakdown of "source": 2, 
traditional social institutions and _ the "target": 3 
emergence of new, fluid social forms.", } 
"related_concepts": [4, 5, 7] ] 
}, 
{ 
"name": "Psychological Dislocation", 
"iq": 7, 
"description": "The disorienting effects 
of rapid change lead to psychological 
dislocation and social malaise, with 


individuals struggling to find a sense of 
identity and purpose.", 
"related_concepts": [2, 3, 6] 
} 
], 


Future Work 


There’s a Cambrian explosion of prompts people share 
online. This primordial soup of prompts appears to be 
a new language, a meta-language if you will. The 
prompts can form a protocol layer that is currently 
being formed, akin to smart contracts that run on the 
blockchain. One of the avenues to develop synthetic 
ontologies might be to crowdsource the construction 
of a system of the most effective prompts. 


Furthermore, the current understanding of the solution 
has only been done at a conceptual level. The approach 
needs a thorough mathematical treatment. Both the 
logic behind the approach as well as mathematical 
representation have to be properly addressed in the 
future work. 


The currently available LLMs can handle the number of 
tokens which is already sufficient to create a small 
ontology piece by piece. This number is expected to 
grow significantly. Soon, we no longer will have a blunt 
tool with a small short-term memory that can process 
only a few key phrases, but will be able to process full 
ontologies that could be generated based on a set of 
data and prompts provided to an LLM. 


Conclusion 

As data continues to be a driving force in the digital 
economy, it is critical that we develop effective tools 
for organizing and interpreting it. Synthetic ontology 
engineering may become a significant step forward in 
this regard. This approach allows for the creation of 
ontologies that are not bound by the constraints of pre- 
defined structures, and can instead be tailored to the 
specific needs of a given use case. To test the 
hypothesis proposed in this essay, the best way is to 
choose the field of life sciences as that is a complex and 
challenging field. While this is still a hypothesis that 
needs to be tested and validated, it is a promising 
solution for the future of knowledge management. 
Over three quarters of time used for data preparation 
can instead be spent on more valuable activity thus 
unlocking immense value for the economy. 


The civilization has come a long way from the first 
stone axe tools until it built the first microscope. In the 
same vein, the currently available LLMs represent a 
relatively blunt technology. However, we, still have to 
learn to use this tool before we are able to develop a 
meta-language for programming LLMs. _ Thus, 
developing protocols to build synthetic ontologies 
might be the first step toward that effect. 


Afterword 


We welcome researchers and organizations in the field 
of knowledge management and life sciences to 
collaborate with us. Please join our Discord channel 
“Connected Science” { https://discord.gg/fMa35fS2 }. if 
you want to join a strong network of like-minded 
individuals that can help you achieve your goals and 
bring your projects to fruition. 
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